Show the code
# Install packages
install.packages("devtools")
install.packages("R.utils")
install.packages("YOURFILELOCATION/SDMTools_1.1-221.2.tar.gz", repos=NULL, type="source", dependancies=TRUE)
devtools::install_github("statsbomb/StatsBombR")In B1700 you have started to learn the basics of R and in the previous practical for B1701 you learned how to load in multiple files at ones. However, there are occasions when your data is not stored in flat files and you may want to pull data from an online database or websites. Going into how to do this without using predefined R library’s is beyond the aims of this course, however, there are many R library’s available which can help you pull data from the web. Examples are:worldfootballR, baseballr, hoopR, SwimmeR, etc. All these packages come with instructions as to how to use them to pull relevant data from a variety of sources and it is worth having a look at some of these. However, for this practical we will use the StatsBombR package. StatsBomb is sports analytics company that specializes in providing data and insights related to football. The company focuses on collecting, analyzing, and delivering detailed statistical information about football matches and players. Most of their data is behind a pay wall, however, they do offer a range of datasets for free and that is the data we will be exploring.
To start, begin by installing the devtools, remotes, and StatsBombR packages by using the following code:
# Install packages
install.packages("devtools")
install.packages("R.utils")
install.packages("YOURFILELOCATION/SDMTools_1.1-221.2.tar.gz", repos=NULL, type="source", dependancies=TRUE)
devtools::install_github("statsbomb/StatsBombR")Next we need to load these packages.
# Load packages
library(StatsBombR)
library(tidyverse)Once you have successfully installed and loaded all the necessary packages, you can begin reading your data.
StatsBombR uses several functions to load data in to R:
FreeCompetitions() shows all the competition data that is available for free.
FreeMatches() shows the available matches within a competition
StatsBombFreeEvents() shows all the event data for all specified matches.
A useful guide about loading in StatsBomb data using R is provided here.
Load in all Free Competitions data and assign this to a tibble named CompDF
CompDF <- (FreeCompetitions())Looking at the CompDF dataframe, we see we have data available for 71 competitions. For this practical we are interested in the Men’s European Championship (MEC), we therefore need to create a dataframe which contains the information of only that competition. We could visually scroll through our table to find the competition ID and filter for based on ID, however, if you have a large data set this may be difficult. We can therefore choose to filter based on a few variables we know relate to the MEC. We will use gender (male), season_name (2020), and international competition (TRUE) as filters.
MEC_DF <- CompDF %>%
filter(competition_gender=="male" & season_name==2020 & competition_international==TRUE)
print(MEC_DF)Now we have created a separate table for just the MEC we can use this to load in all the match data using FreeMatches().
MatchesDF <- (FreeMatches(MEC_DF))You do not need to create a variable if you are sure the correct data will be filtered out. You could embed the filter into the FreeMatches code using a pipeline as follows:
MatchesDF <- CompDF %>%
filter(competition_gender=="male" & season_name==2020 & competition_international==TRUE) %>%
FreeMatches()We have loaded in all match information for the Men’s Europeans Championship but what we are really after is event data. Event data will give us the opportunity to analyse the performance of individual players and teams.
We will use free_allevents() to load in all event data for all the matches played during the MEC.
ECDataDF <- free_allevents(MatchesDF)The last step in loading StatsBomb data is using the allclean() function, this is not just a cleaning operation but this function creates some additional variables which may come in useful later on (e.g. location data split in x and y coordinates).
ECDataDF <- as.tibble(allclean(ECDataDF))Once we have finished loading in our data we would like to save the data set as RData.
saveRDS(ECDataDF, file="C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/ECData.rds")In the code above, the select_if() function is used to select columns from the ECData data frame based on a condition. The condition here is is.list, which checks if the values in the columns are of list data type. The result of this operation is a new data frame called list_vars containing only the columns that have list-type values. As we are only interested in the column names we will use list_vars <- c(names(list_vars)) to convert the column names of the list_vars data frame into a vector using the c() function. It effectively stores the names of the columns with list-type values as a character vector. Last we will use the list_vars vector to compare this to the column names within ECData and only keep those which do not match (!) the names in the list_vars. In other words, it removes columns that were identified earlier as having list-type values. We can then save our ECData as a .csv file.
If we did not want to delete the variables which were structured as a list we could have saved our dataset as a RDS file (R Data Store). To do this we could use the saveRDS() function in a similar way as using write.csv() . To open an RDS file we would use readRDS(). In summary, while CSV and Excel formats are easy to use outside of R, they might not be the best choices for preserving list variable structures, the RDS format is recommended for maintaining the integrity of data frames with list variables.
Exercise 1: Make sure StatsBombR and tidyverse are installed and loaded.
# Install packages
library(StatsBombR)
library(tidyverse)Exercise 2: Load all event data for the 2018 National Women’s Soccer League.
CompDF <- (FreeCompetitions())
NWSLDF <- CompDF %>%
filter(competition_gender=="female" & season_name==2018 & competition_international==FALSE)
print(NWSLDF)
MatchesDF <- (FreeMatches(NWSLDF))
NWSLDataDF <- free_allevents(MatchesDF)
NWSLDataDF <- as.tibble(allclean(NWSLDataDF))Exercise 3: Save your data file using writeRDS()
saveRDS(NWSLDataDF, "C:/Users/wkb14101/OneDrive - University of Strathclyde/MSc SDA/R Projects/B1701/data/NWSLData.rds")